FASTQ sequence dataFASTQ sequence data!!! Coding style ~ 我有盡量檢查了,但是真的好多Q我怕我有沒有按照的地方,學姊可以幫我注意一下~ code, file suffix, value, parameter ‘file directory’ ‘operating system’ Normal software name
Read counts? Reads count?
RNA-Seq is a revolutionary approach to investigate and discover the transcriptome using next-generation sequencing technologies [ref]. Typically, this transcriptome analysis aims to identify genes differentially expressed among different conditions or tissues, resulting in the understanding of the important pathways that are associated with conditions.[ref]
RNASeqWorkflow is an user-friendly R-based tool for running RNA-Seq analysis pipeline including quality assessment, reads alignment and quantification, differential expression analysis, and functional analysis. The main features of this package are automated workflow, comprehensive report with data visualization and extendable file structure. In this package, new tuxedo pipeline published in Nature Protocols in 2016 can be fully implemented under R environment with extra functions such as reads quality assessment, trimming and functional analysis.
The following are main tools that are used in this package: ‘HISAT2’ for reads alignment(Kim et al. 2015); ‘StringTie’ for alignments assembly and transcripts quantification(Pertea et al. 2015); ‘Rsamtools’ for converting SAM files to BAM files(Morgan et al. 2018); ‘Gffcompare’ for comparing merged GTF file with reference GTF file; ‘systemPipeR’ package for quality assessment(Backman et al. 2016); ‘ShortRead’ package for quality trimming(Morgan et al. 2009); ‘ballgown’ package(Fu et al. 2018), ‘TPM & Student’s t-test’, ‘DESeq2’ package(Love et al. 2014) and ‘edgeR’ package(Robinson et al. 2010;McCarthy et al. 2012) for finding potential differential expressed genes; ‘clusterProfiler’ package(Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.
The central concept behind this package is that each step involved in RNA-Seq data analysis is a function call in R. At the beginning, users will create a RNASeqWorkFlowParam S4 object by running RNASeqWorkflowParam() constructor function for all variable checking. After the creation of RNASeqWorkFlowParam, it will be used as input of the following analysis function.
RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet(): to setup RNA-Seq environment.
RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment(): (Optional) to run quality assessment step.
RNASeqQualityTrimming_CMD() or RNASeqQualityTrimming(): (Optional) to run quality trimming step.
RNASeqReadProcess_CMD() or RNASeqReadProcess(): to run reads alignment and quatification.
RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis(): to run differential analysis via different R packages.
RNASeqGoKegg_CMD() or RNASeqGoKegg(): to conduct GO and KEGG analysis.
Functions with CMD suffix create an R script and run nohup R CMD BATCH script.R in background. Functions with no CMD suffix process in R shell. After running the above functions, the whole RNA-Seq analysis is done and generated files in each step will be stored in organized file directory. RNASeqWorkflow package makes two-group RNA-Seq analysis more efficient and easier for users.
Functions with CMD suffix will create an R script and run nohup R CMD BATCH script.R in background while functions with no CMD suffix will be processed in R shell. Files generated in each step will be kept in proper directory. Once the workflow is completed, a comprehensive RNA-Seq analysis is done. Additionally, this package is mainly designed for a two-group comparison setting, i.e.(??????) differential expression profile between two conditions.
Sample data used in this vignette can be downloaded from RNASeqWorkflowData experiment package. It was originated from NCBI’s Sequence Read Archive for the entries SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, and SRR3396387. These samples were from Saccharomyces cerevisiae. Suitable reference genome and gene annotation files for this species can be further downloaded from either iGenomes or Ensembl R64-1-1.(??????????????????) To create mini data for demonstration purpose, reads aligned to the region from 0 to 100000 at chromosome XV were extracted.
Necessary:
R version >= 3.5.0
Operating System: ‘Linux’ and ‘macOS’ are supported in RNASeqWorkflow package. ‘Windows’ is not supported. (Because ‘StringTie’ and ‘HISAT2’ are not available for ‘Windows’)
Third-party softwares used in this package include ‘HISAT2’, ‘StringTie’ and ‘Gffcomapre’. The availability of these commands will be checked by R system2() through R shell at the end of ‘Environmnet Setup’ step. Users must successfully setup environment before running the following RNA-Seq analysis. By default, binaries will be installed base on the operating system of the workstation. No compilation is needed. However, users can still choose to skip certain software binaries. More details please refer to ‘Environment Setup’ chapter.(?????????????????)
Recommended:
Python: Python2 or Python3.
2to3: If the Python version in workstation is 3, this command will be used. Generally, 2to3 is available if Python3 is available.
Python and 2to3 are used for creating raw reads count for DESeq2 and edgeR.
2to3 command available.If one of these conditions meets, raw reads count will be created and DESeq2, edgeR will be run automatically by default in ‘Gene-level Differential Analyses’ step. If not, DESeq2 and edgeR will be skipped during ‘Gene-level Differential Analyses’ step. Checking Python version and 2to3 command in workstation beforehands are highly recommended but not necessary.
‘HISAT2’ indexex: Users are advised to provide ‘indices/’ directory in ‘inputfiles/’. ‘HISAT2’ requires at least 160 GB of RAM and several hours to index the entire human genome.
source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script
biocLite("RNASeqWorkflow") # Installs RNASeqWorkflow
biocLite("RNASeqWorkflowData") # Installs RNASeqWorkflowDataThis is the preparation step of RNA-Seq workflow in this package. Prior to conducting RNA-Seq analysis, it is necessary to implement a constructor function, called RNASeqWorkFlowParam() and create a RNASeqWorkFlowParam S4 object which stores parameters not only for pre-checking but also for utilizing as input parameters in the following analyses.
RNASeqWorkFlowParam Slots ExplanationThere are 11 slots in RNASeqWorkFlowParam:
os.type : The operating system type. Value is linux or osx. This package only support ‘Linux’ and ‘macOS’ (no ‘Windows’). If other operating system is detected, ERROR will be reported.
python.variable : Python-related variable. Value is a list of whether Python is available and Python version (TRUE or FALSE, 2 or 3).
python.2to3 : Availability of 2to3 command. Value is TRUE or FALSE.
path.prefix : Path prefix of ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results/’, ‘Rscript/’ and ‘Rscript_out/’ directories. It is recommended to create a new directory with out any file inside and all the following RNA-Seq results will be installed in it.
input.path.prefix : Path prefix of ‘input_files/’ directory. User have to prepare an ‘input_file/’ directory with the following rules:
genome.name.fa: reference genome in FASTA file formation.
genome.name.gtf: gene annotation in GTF file formation.
raw_fastq.gz/: directory storing FASTQ files.
Support paired-end reads files only.
Names of paired-end FASTQ files : ’sample.pattern_1.fastq.gz’ and ’sample.pattern_2.fastq.gz’. sample.pattern must be distinct for each sample.
phenodata.csv: information about RNA-Seq experiment design.
First column : Distinct ids for each sample. Value of each sample of this column must match sample.pattern in FASTQ files in ‘raw_fastq.gz/’. Column names must be ids.
Second column : independent variable for the RNA-Seq experiment. Value of each sample of this column can only be parameter case.group and control.group. Column name is parameter independent.variable.
indices/ : directory storing HT2 indices files for HISAT2 alignment tool.
This directory is optional. HT2 indices files corresponding to target reference genome can be installed at HISAT2 official website. Providing HT2 files can accelerate the subsequent steps. It is highly advised to install HT2 files.
If HT2 index files are not provided, ‘input_files/indices/’ directory should be deleted.
genome.name : Variable of genome name defined in this RNA-Seq workflow (ex. ‘genome.name.fa’, ‘genome.name.gtf’)
sample.pattern : Regular expression of paired-end fastq.gz files under ‘input_files/raw_fastq.gz/’. IMPORTANT!! Expression shouldn’t have _[1,2].fastq.gz in end.
independent.variable: Independent variable for the biological experiment design of two-group RNA-Seq workflow.
case.group : Group name of the case group.
control.group : Group name of the control group.
indices.optional : logical value whether ‘input_files/indices/’ is exit. Value is TRUE or FALSE
RNASeqWorkFlowParam Constructor Checking StepsCreate a new directory for RNA-Seq analysis. It is highly recommended to create a new directory without any files inside. The parameter path.prefix of RNASeqWorkFlowParam() constructor should be the absolute path of this new directory. All the RNA-Seq related files that generated in the following steps will be stored inside of this directory.
Create valid ‘input_files/’ directory. You should create a file directory named ‘input_files/’ with neccessary files inside. It should follow the rules metioned above.
RNASeqWorkFlowParam S4 object. This constructor will check the validity of input parameters before creating S4 objects.
Operating system
Python version
2to3 command
Structure, contents and rules of ‘inputfiles/’
Validity of input parameters
library(RNASeqWorkflow)
library(RNASeqWorkflowData)input_files.path <- system.file("extdata/", package = "RNASeqWorkflowData")
rnaseq_result.path <- "/tmp/yeast_example/"
dir.create(rnaseq_result.path)
list.files(input_files.path, recursive = TRUE)## [1] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.fa"
## [2] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.gtf"
## [3] "input_files/phenodata.csv"
## [4] "input_files/raw_fastq.gz/SRR3396381_XV_1.fastq.gz"
## [5] "input_files/raw_fastq.gz/SRR3396381_XV_2.fastq.gz"
## [6] "input_files/raw_fastq.gz/SRR3396382_XV_1.fastq.gz"
## [7] "input_files/raw_fastq.gz/SRR3396382_XV_2.fastq.gz"
## [8] "input_files/raw_fastq.gz/SRR3396384_XV_1.fastq.gz"
## [9] "input_files/raw_fastq.gz/SRR3396384_XV_2.fastq.gz"
## [10] "input_files/raw_fastq.gz/SRR3396385_XV_1.fastq.gz"
## [11] "input_files/raw_fastq.gz/SRR3396385_XV_2.fastq.gz"
## [12] "input_files/raw_fastq.gz/SRR3396386_XV_1.fastq.gz"
## [13] "input_files/raw_fastq.gz/SRR3396386_XV_2.fastq.gz"
## [14] "input_files/raw_fastq.gz/SRR3396387_XV_1.fastq.gz"
## [15] "input_files/raw_fastq.gz/SRR3396387_XV_2.fastq.gz"
Check the files in ‘inputfiles/’ directory.
exp <- RNASeqWorkflowParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control")In this example, RNASeqWorkFlowParam S4 object is store in exp for subsequent RNA-Seq analysis steps. Any ERROR occured in checking steps will terminate the program.
This is the first necessary step of RNA-Seq workflow in this package. To set up the environment, run RNASeqEnvironmentSet_CMD() to execute process in background or running RNASeqEnvironmentSet() to execute process in R shell.
path.prefix directory. Here are the usage of these five main directories:
‘gene_data/’: Symblic links of ‘input_files/’ and files that are created in each step of RNA-Seq analysis will be stored in this directory.
‘RNASeq_bin/’: The binaries of necessary tools, HISAT2, SAMtools, StringTie and Gffcompare, are installed in this directory.
‘RNASeq_results’: The RNA-Seq results, for example, alignment results, quality assessment results, differential analysis results etc., will be stored in this directory.
‘Rscript’: If your run XXX_CMD() function, corresponding R script(XXX.R) for certain step will be created in the directory.
‘Rscript_out’: The corresponding output report for R script(XXX.Rout) will be stored in this directory.
path.prefix directory.The operating system of your workstation will be detected. If the operating system were not Linux and macOS, ERROR would be reported. Users can decide whether the installation of essential programs(HISAT2, StringTie and Gffcompare) are going be automatically processed.
Third-party softwares used in this package include ‘HISAT2’, ‘StringTie’ and ‘Gffcomapre’. Binaries are all available for these three softwares, and by default, they will be installed base on the operating system of the workstation automatically. Zipped binaries will be unpacked, and exported to R environment PATH. No compilation is needed.
To specify, there are three parameters(install.hisat2, install.stringtie and install.gffcompare) in both RNASeqEnvironmentSet_CMD() and RNASeqEnvironmentSet() functions for users to determine which software is going to be installed automatically or to be skipped. The default settings of these parameters are TRUE so that these three programs will be installed directly. Otherwise, users can skip certain software installation process by turning the values to FALSE. Please make sure to check the skipped programs are available by system2() through R shell. Any unavailability of each program will cause fail in ‘Environment Setup’ step.
Here are the version information of each software binary.
hisat2-2.1.0-Linux_x86_64.zip or hisat2-2.1.0-OSX_x86_64.zip zipped file will be installed.stringtie-1.3.4d.Linux_x86_64.tar.gz or stringtie-1.3.4d.Linux_x86_64 zipped file will be installed.gffcompare-0.10.4.Linux_x86_64.tar.gz or gffcompare-0.10.4.Linux_x86_64.tar.gz zipped file will be installed.‘RNASeq_bin/’ will be added to R environment PATH so that these binaries can be found in R environment in R shell through system2(). In the last step of environment setting, hisat2 --version,stringtie --version,gffcompare --version,samtools --version commands will be checked in order to make sure the environment is correctly constructed. Environment must be setup successfully before the following analyses.
Run RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet().
RNASeqEnvironmentSet_CMD(exp)RNASeqEnvironmentSet(exp@path.prefix,
exp@input.path.prefix,
exp@genome.name,
exp@sample.pattern,
exp@indices.optional,
exp@os.type)FASTQ sequence dataThis is an optional step of RNA-Seq workflow in this package. However, it is strongly recommended before processing the alignment step. To evaluate the quality of raw reads in FASTQ files, it can be achieved by running RNASeqQualityAssessment_CMD() to execute process in background or running RNASeqQualityAssessment() to execute process in R shell.
In this step, systemPipeR package is used for evaluating sequencing reads and the details are as follows:
Check the number of times that user has run quality assessment process and create the corresponding files ‘RNASeq_results/QA_results/QA_{times}’.
RNA-Seq environment set up. ‘rnaseq/’ directory will be created by systemPipeR package.
Create ‘data.list.txt’ file.
Reading FASTQ files and create ‘fastqReport.pdf’ as the report result of quality assessment
Remove ‘rnaseq/’ directory.
This quality assessment result is generated by systemPipeR package. It will be stored as PDF.
Run RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment().
RNASeqQualityAssessment_CMD(exp)RNASeqQualityAssessment(exp@path.prefix,
exp@input.path.prefix,
exp@sample.pattern)FASTQ sequence dataThis is also an optional step of RNA-Seq workflow in this package. However, it is recommended when the bad quality of sequencing reads exist. To trim the raw reads in FASTQ files, users can either run RNASeqQualityTrimming_CMD() to execute process in background or run RNASeqQualityTrimming() to execute process in R shell. After finishing quality trimming, it is highly suggested to re-conduct quality assessment. Different quality assessment result will be stored in different directories with index of processing order.
In this step, ShortRead package is ultilized for trimming the bad quality ends off the reads and the details are as follows:
FASTQ files with sample.pattern in ‘gene_data/raw_fastq.gz’ will be listed and be check. This trimming function only support paired-end FASTQ files.
The probability of error per each base pair is calculated by Phred Quality in FASTQ files and the cumulative probability of error of each base pair will also be calculated. The threshold cumulative probability cum.error, with default value 1, is used to find the trimming point of paired end files.
reads.length.limit, with default value 36, determines the minimum base pair length of reads. Length of reads less than reads.length.limit will be filtered out.
Untrimmed original FASTQ files are moved to ‘gene_data/raw_fastq.gz/original_untrimmed_fastq.gz’, and trimmed reads are written to ‘gene_data/raw_fastq.gz/’
It is strongly recommended to run RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment() for quality assessment after quality trimming.
Run RNASeqQualityTrimming_CMD() or RNASeqQualityTrimming().
RNASeqQualityTrimming_CMD(exp)RNASeqQualityTrimming(exp@path.prefix,
exp@sample.pattern)This is a second necessary step of RNA-Seq workflow in this package. To process raw reads FASTQ files, users can either run RNASeqRawReadProcess_CMD() to execute process in background or run RNASeqRawReadProcess() to execute process in R shell. For further details about commands and parameters that executed during each step, please check the reported ‘RNASeq_results/COMMAND.txt’.
In preparation step(RNASeqWorkFlowParam creation step), ‘indices/’ directory is checked whether HT2 indices files already exist. If not, the following commands will be executed:
Input: ‘genome.name.gtf’, ‘genome.name.fa’
Output: ‘genome.name.ss’, ‘genome.name.exon’, ’genome.name_tran.{number}.ht2’
extract_splice_sites.py, extract_exons.py execution
hisat2-build index creation
genome.name.ss and genome.name.exon created in the previous step. Be aware that index building step requires a larger amount of memory and longer time than steps, and it might not be possible to run on some personal workstations. It is highly recommended to check the availibility of HT2 indices files at HISAT2 official website for your target reference genome beforehands. Install HT2 indices files will greatly shorten the analysis time.Input: ’genome.name_tran.{number}.ht2’, ‘sample.pattern.fastq.gz’
Output: ‘sample.pattern.sam’
hisat2 command is executed on paired end FASTQ files. SAM files will be created.
SAM files are stored in ‘gene_data/raw_bam/’.SAM to BAMInput: ‘sample.pattern.sam’
Output: ‘sample.pattern.bam’
samtools in R environment. In this step, SAM files from HISAT2 will be converted to BAM files by running asBam() function.
BAM files are stored in ‘gene_data/raw_sam/’.Input: ‘genome.name.gtf’, ‘sample.pattern.bam’
Output: ‘sample.pattern.gtf’
stringtie command is executed.
GTF files which are from each FASTQ files are stored in ‘gene_data/raw_gtf/’GTFInput: ‘sample.pattern.gtf’
Output: ‘stringtiemerged.gtf’, ‘mergelist.txt’
stringtie command is executed.
sample.pattern.gtf into stringtiemerged.gtfInput: ‘genome.name.gtf’, ‘stringtie_merged.gtf’
Output: ‘merged.annotated.gtf’, ‘merged.loci’, ‘merged.stats’, ‘merged.stringtie_merged.gtf.refmap’, ‘merged.stringtie_merged.gtf.tmap’, ‘merged.tracking’
gffcompare command is executed.
GTF file and reference annotation file is reported under ‘merged/’ directory.Input: ‘stringtie_merged.gtf’
Output: ‘ballgown/’, ‘gene_abundance/’
stringtie command is executed.
TSV file by each sample name in ‘gene_data/gene_abundance/’.Whether this step is executed depends on the availability of Python on your workstation.
Input: ‘samplelst.txt’
Output: ‘gene_count_matrix.csv’, ‘transcript_count_matrix.csv’
Reads count table converter Python script is downloaded as prepDE.py
If Python is not available, this step is skipped.
If Python2 is available, prepDE.py is executed.
If Python3 is available, 2to3 command will be checked.(Usally, if Python3 is installed, 2to3 command will be installed too.)
If Python3 is available but 2to3 command is unavailable, raw reads count generation step will be skipped.
If Python3 and 2to3 command is available, prepDE.py is converted to file that can be executed by Python2 and be executed.
Run RNASeqReadProcess_CMD() or RNASeqReadProcess().
RNASeqReadProcess_CMD(exp)python.variable <- "@"(exp, python.variable)
python.variable.answer <- python.variable$check.answer
python.variable.version <- python.variable$python.version
RNASeqReadProcess(exp@path.prefix,
exp@input.path.prefix,
exp@genome.name,
exp@sample.pattern,
python.variable.answer,
python.variable.version,
exp@python.2to3,
num.parallel.threads = 10,
exp@indices.optional)This is the third necessary step of RNA-Seq workflow in this package. To do differential expression analysis, run RNASeqDifferentialAnalysis_CMD() to execute process in background or run RNASeqDifferentialAnalysis() to execute process in R shell. In RNASeqWorkflow package, we provide four ways to do differential expression analysis: ballgown, TPM & Student t-test, DESeq2 and edgeR. Gene ids reported by StringTie(raw reads count) and ballgown(result) will be mapped to ‘gene_name’ provided in GTF file. ‘gene_name’ will be used to do gene functional analysis and pathway analysis in the next step.(!!!Functional 還有一些問題~ 這週要和學姊討論一下!!!)
If you run RNASeqRawReadProcess_CMD() in the previous step, ‘Rscript_out/Read_Process.Rout’ will be created and in the beginning of this step, alignment results, aCSV file and a PNG file, will be reported based on ‘Rscript_out/Read_Process.Rout’ file.
‘ballgown’, ‘TPM & Student t-test’, ‘DESeq2’ and ‘edgeR’ will be processed in order in this step. After doing differential expression analysis, results of each method will be visualized and stored in ‘RNASeq_results/’. The pictures showed in the following steps are the visualization result of data in RNASeqWorkflowData experiment package. The detail steps and the method-specific plots in each method will be discussed sperately later. Here we focus on the some general plots that will be created and stored in the directories of these four differential analysis tools. We only show DESeq2 result plots as an example. The following are the general plots(DESeq2):
Frequency Plot: ‘Frequency_Plot_normalized_count_ggplot2.png’
The frequency of normalized reads count(In DESeq2 is MRN). The range of x axis(MRN) is from 0% of data minus 20(smallest MRN values - 20) to 90% of data plus 20. It is drawn by ggplot2.
Frequency Plot: ‘Frequency_Plot_log_normalized_count_ggplot2.png’
The frequency of log base two of normalized reads count plus one(in DESeq2 is log2(MRN+1)). The range of x axis(log2(MRN+1)) is from 0% of data minus 5(smallest log2(MRN+1) - 5) to 99% of data plus 5. It is drawn by ggplot2.
Distribution Plots: ‘Box_Plot_ggplot2.png’
The distribution of log base two of normalized reads count plus one(in DESeq2 is log2(MRN+1)) depicting groups of numerical . Case is on left-hand side in turquoise and control is on right-hand side in yellow. It is drawn by ggplot2.
Distribution Plots: ‘Violin_Plot_ggplot2.png’
The distribution of log base two of normalized reads count plus one(in DESeq2 is log2(MRN+1)) of each sample. Case is on left-hand side in green and control is on right-hand side in yellow. It is drawn by ggplot2.
In principal component analysis(PCA) test before differential expression analysis, we use FactoMineR package to calculate principal component scores and factoextra to extract and visualize the results of principal component scores. Three plots are provided.
PCA Plots: ‘Dimension_PCA_Plot_factoextra.png’
Percentage of explained variances in top five dimensions are plotted. The top two percentage of explained variance will be used to plot PCA plot.
PCA Plots: ‘PCA_Plot_factoextra.png’
This PCA plot is drawn by fviz_pca_ind() function in factoextra. Case group is in green and control group is in yellow. The smaller points represent the individual samples and the bigger points represent the average of case group samples and control group samples. Ellipses will be added for case and control group samples.
PCA Plots: ‘PCA_Plot_ggplot2.png’
This PCA plot is drawn directly by ggplot2 with principal component scores calculated by FactoMineR package. Case group is in green and control group is in yellow.
Correlation Plot: ‘Correlation_Heat_Plot_ggplot2.png’
Correlations between each pair of samples is calculated by using R Stats package and drawn by ggplot2. Maximum correlation(1) is in red and minimum correlation in all sample pairs is in blue.
Correlation Plot: ‘Correlation_Dot_Plot_corrplot.png’
Correlation dot plot is drawn by corrplot() function in corrplot package. Maximum correlation(1) is in red and minimum correlation in all sample pairs is in blue.
Correlation Plot: ‘Correlation_Bar_Plot_PerformanceAnalytics.png’
Correlation bar plot is drawn by chart.Correlation() function in PerformanceAnalytics package.
Volcano Plot: ‘Volcano_Plot_graphics.png’
Volcano plot is drawn by ggplot2. X axis is log base two of fold change(log2(fold change)); Y axis is negative log base ten of p-value(log10(p-value)). The upregulated and downregulated genes are colored with red and green.
After differential expression(DE) analysis, principal component analysis is done again. Same as pre-DE PCA, we use FactoMineR package to calculate principal component scores and factoextra to extract and visualize the results of principal component scores. Three plots are provided.
Heatmap Plot: ‘Heatmap_Plot_pheatmap.png’
Heatmap is drawn by pheatmap() function in pheatmap package. Log base 2 of normalized count plus one(log2(FPKM+1) for ballgown) is calculated first. All calculated sample values substract mean of case group sample values(log2(case_mean_FPKM+1) - log2(each_FPKM+1) ) will be used to drawn this heatamp. Case group samples are on the left in green and control group samples are on the right in yellow.
Ballgown is designed to simplify differential expression analysis of RNA-Seq data(more information in ballgown github). The default statistical model is a parametric F-test comparing nested linear models. Normalized row count in ballgown is represented as FPKM values. The following are ballgown analysis steps:
ballgown object is created and written in ‘RNASeqresults/ballgownanalysis/ballgownRobject/ballgown.rda’.
The row sum of sample FPKM values equals 0 will be filter out. Moreover, row of extracted statistic results from created ballgown object with pval equals 0 or qval equals 0 will be filterd out too. Log base two fold change(log2(fold change)) value will be calculated as column name log2FC. According to the ‘gene_data/phenodata.csv’ file, normalized count will be seperated into case and control group with column name sample.pattern.FPKM.
A report CSV file combined with normalized FPKM count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/ballgown_analysis/ballgown_normalized_result.csv’.
Differential expressed gene result ‘RNASeq_results/ballgown_analysis/ballgown_normalized_DE_result.csv’ is reported. Default selecting threshold is pval < 0.05 and log2FC > 1 | log2FC < 1.
The plot visualizes the differences between case and control by transforming the data onto X axis log2FC and Y axis log2(FPKM.mean). It is drawn by ggplot2 package.
Normalized count is represented as TPM values. The statistic model used here is Student’s t-test. The following are analysis steps:
Calculate TPM values from ballgown FPKM values and save with column name sample.pattern.FPKM FPKM values.
Use Student’s t-test to calculate p-value and saved as column name pval.
Calculate fold change values as dividing average TPM values of control group plus one by average TPM values of case group plus one((Average_case_TPM+1) / (Average_control_TPM+1) ) with column name fc. Log base two fold change(log2(fold change)) is also calculated and stored as log2FC.
A report CSV file combined with normalized TPM count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/TPM_analysis/TPM_normalized_result.csv’.
Differential expressed gene result ‘RNASeq_results/TPM_analysis/TPM_normalized_DE_result.csv’ is reported. Default selecting threshold is pval < 0.05 and log2FC > 1 | log2FC < 1
Visualization. The general plots mentioned above and the following plot will be drawn.
MA Plot
The plot visualizes the differences between case and control by transforming the data onto X axis log2FC and Y axis log2(TPM.mean). It is drawn by ggplot2 package.
DESeq2 estimate variance-mean dependence in raw reads count( more information inDESeq2 document). The statistic model for differential expression is using negative binomial distribution. DESeq2 package takes sequence depth and RNA composition into consideration and use median of ratios normalization(MRN) method to reads count normalization (related site). The follwing are the analysis steps.
o will be filtered out)Create DESeqDataSet object from raw reads count of genes by running DESeqDataSetFromMatrix() function in DESeq2.
Run DESeq2() function to do differential expression analysis.
A report CSV file combined with normalized MRN count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/DESeq2_analysis/DESeq2_normalized_result.csv’.
Differential expressed gene result ‘RNASeq_results/DESeq2_analysis/DESeq2_normalized_DE_result.csv’ is reported. Default selecting threshold is pval < 0.05 and log2FC > 1 | log2FC < 1
Visualization. The general plots mentioned above and the following plots will be drawn.
Dispersion Plot
DESeq2 provides a useful function, plotDispEsts(), to plot the dispersion estimates. X axis is mean of normalized counts; Y axis is dispersion.
MA Plot
The plot visualizes the differences between case and control by transforming the data onto X axis mean of normalized counts and Y axis log fold change. It is created by plotMA() function in DESeq2.
edgeR, a tool for differential expression analysis, implements a range of statistical method based on negative binomial distribution. edgeR package normalizes for RNA composition by finding a set of scaling factors for the library sizes, which is computed by trimmed mean of M-values(TMM). Normalized count is represented as cpm with library size scaled by TMM (related site). The following are analysis steps:
o will be filtered out)Create DEGList object from raw reads count by running DGEList() function.
Normalize DEGList object with TMM by running calcNormFactors() function.
Run estimateCommonDisp() to maximize the negative binomial conditional common likelihood and then run estimateTagwiseDisp() to estimate tagwise dispersion values by an empirical Bayes method.
Compute genewise exact tests and get statistic results by running exactTest().
Get normalized count by running cpm() on TMM scaled DEGList object.
A report CSV file combined with normalized TMM count, average of case group, control group and statistic results will be stored as ‘RNASeq_results/edgeR_analysis/edgeR_normalized_result.csv’.
Differential expressed gene result ‘RNASeq_results/edgeR_analysis/edgeR_normalized_DE_result.csv’ is reported. Default selecting threshold is pval < 0.05 and log2FC > 1 | log2FC < 1
MeanVar Plot
edgeR provides a useful function plotMeanVar() to visualize the mean-variance relationship in DEGList.
BCV Plot
edgeR provides a useful function plotBCV() to plot the genewise biological coefficient of variance(BCV) against gene abundace.
MDS Plot
edgeR provides a useful function plotMDS.DGEList() to plot samples on a 2D scatterplot with distances representing expression differences between the samples.
Smear Plot
edgeR provides a useful function plotSmear() to plot log base 2 fold change(log2(fold change))) against the log-concentration.
Run RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis().
RNASeqDifferentialAnalysis_CMD(exp)RNASeqDifferentialAnalysis(exp@path.prefix,
exp@genome.name,
exp@sample.pattern,
exp@independent.variable,
exp@case.group,
exp@control.group,
ballgown.pval = 0.05,
ballgown.log2FC = 1,
TPM.pval = 0.05,
TPM.log2FC = 1,
DESeq2.pval = 0.1,
DESeq2.log2FC = 1,
edgeR.pval = 0.05,
edgeR.log2FC = 1)This is the third optional step of RNA-Seq workflow in this package. clusterProfiler is used in this step. Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis are made based on the differential expressed genes(DEG) that are found in four different differential analyses, ballgown, TPM&t-test, DESeq2 and edgeR. Run RNASeqGoKegg_CMD() to execute process in background or run RNASeqGoKegg() to execute process in R shell.
Users have to provide gene name type, input.TYPE.ID, that used in StringTie, ballgown and supported in OrgDb.species annotation package for target species. In GO functional analysis and KEGG pathway analysis, input.TYPE.ID ID type will be converted into ENTREZID ID type by bitr() function in clusterProfiler first. Those input.TYPE.ID with no corresponding ENTREZID will return NA and be filtered out. The genes with Inf or -Inf log2 fold change will be filtered out too. ID conversion will be done in each differential analysis tools, ballgown, TPM & Student’s t-test, DESeq2 and edgeR.
In this example, the RNA-Seq analysis target species is Saccharomyces cerevisiae(yeast). The OrgDb.species is org.Sc.sgd.db; the input.TYPE.ID is GENENAME. IDs are converted from GENENAME to ENTREZID.
Gene Ontology is the framework for the model of biology. It defines the universe of concepts relating to gene functions(GO terms) along three aspects: molecular function(MF), cellular component(CC), biological process(BP), and how these functions are related to each other(relation) (GO website). In this process, GO gene set enrichment analysis, GO classification and GO over-representation test are implemented by clusterProfiler package.
The input gene set is differential genes that are reported in each differential expression analysis tools. groupGO() function in clusterProfiler is used. Classification result will be stored in CSV. The result of classification will be visualized. 15 highest Counts functional description will be selected and plotted as bar plot.
In this example, only the DESeq2 CC GO Classification bar plot is showed.
The input gene set is differential genes that are reported in each differential analysis tools. enrichGO() function in clusterProfiler is used. GO over-representation result will be stored in CSV. The result of GO over-representation will be visualized as bar plot and dot plot.
In this example, only the DESeq2 MF GO Over-representation bar plot and DESeq2 MF GO Over-representation dot plot are showed.
Kyoto Encyclopedia of Genes and Genomes(KEGG) is a database resource for understanding functions and utilities of the biological system from molecular-level information (KEGG website). In this process, KEGG gene set enrichment analysis, KEGG over-representation test are implemented by clusterProfiler package.
The input gene set is differential genes that are reported in each differential analysis tools. enrichKEGG() function in clusterProfiler is used. KEGG over-representation result will be stored in CSV. The pathway IDs that found in of KEGG over-representation will be visualized with pathview package. KEGG pathway URL will also be stored.
In this example, due to the limited differential expressed genes, no over-represented pathways are found.
RNASeqGoKegg_CMD(exp,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")RNASeqGoKegg(exp@path.prefix,
exp@independent.variable,
OrgDb.species = "org.Sc.sgd.db",
go.level = 3,
input.TYPE.ID = "GENENAME",
KEGG.organism = "sce")sessionInfo()## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 17.10
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /home/kuan-hao/anaconda3/lib/libmkl_rt.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] RNASeqWorkflowData_0.1.0 png_0.1-7
## [3] BiocStyle_2.9.6 RNASeqWorkflow_0.99.0
## [5] edgeR_3.23.3 limma_3.37.4
## [7] pathview_1.21.2 org.Hs.eg.db_3.6.0
## [9] AnnotationDbi_1.43.1 IRanges_2.15.17
## [11] S4Vectors_0.19.19 Biobase_2.41.2
## [13] BiocGenerics_0.27.1 ggplot2_3.0.0
##
## loaded via a namespace (and not attached):
## [1] reticulate_1.10 tidyselect_0.2.4
## [3] RSQLite_2.1.1 htmlwidgets_1.2
## [5] FactoMineR_1.41 BiocParallel_1.15.11
## [7] devtools_1.13.6 BatchJobs_1.7
## [9] munsell_0.5.0 units_0.6-0
## [11] systemPipeR_1.15.3 withr_2.1.2
## [13] colorspace_1.3-2 GOSemSim_2.7.1
## [15] Category_2.47.1 knitr_1.20
## [17] rstudioapi_0.7 leaps_3.0
## [19] DOSE_3.7.1 KEGGgraph_1.41.0
## [21] urltools_1.7.1 org.Rn.eg.db_3.6.0
## [23] GenomeInfoDbData_1.1.0 hwriter_1.3.2
## [25] bit64_0.9-7 pheatmap_1.0.10
## [27] rprojroot_1.3-2 xfun_0.3
## [29] R6_2.2.2 GenomeInfoDb_1.17.1
## [31] locfit_1.5-9.1 bitops_1.0-6
## [33] fgsea_1.7.1 gridGraphics_0.3-0
## [35] DelayedArray_0.7.39 assertthat_0.2.0
## [37] scales_1.0.0 ggraph_1.0.2
## [39] nnet_7.3-12 enrichplot_1.1.7
## [41] gtable_0.2.0 org.Mm.eg.db_3.6.0
## [43] ballgown_2.13.1 sva_3.29.1
## [45] systemPipeRdata_1.9.4 rlang_0.2.2
## [47] genefilter_1.63.2 BBmisc_1.11
## [49] scatterplot3d_0.3-41 splines_3.5.0
## [51] rtracklayer_1.41.5 lazyeval_0.2.1
## [53] acepack_1.4.1 europepmc_0.3
## [55] brew_1.0-6 checkmate_1.8.5
## [57] BiocManager_1.30.2 yaml_2.2.0
## [59] reshape2_1.4.3 GenomicFeatures_1.33.2
## [61] backports_1.1.2 rafalib_1.0.0
## [63] qvalue_2.13.0 Hmisc_4.1-1
## [65] clusterProfiler_3.9.2 RBGL_1.57.0
## [67] tools_3.5.0 bookdown_0.7
## [69] ggplotify_0.0.3 RColorBrewer_1.1-2
## [71] ggridges_0.5.0 Rcpp_0.12.18
## [73] plyr_1.8.4 base64enc_0.1-3
## [75] progress_1.2.0 zlibbioc_1.27.0
## [77] purrr_0.2.5 RCurl_1.95-4.11
## [79] prettyunits_1.0.2 rpart_4.1-13
## [81] viridis_0.5.1 cowplot_0.9.3
## [83] zoo_1.8-3 SummarizedExperiment_1.11.6
## [85] ggrepel_0.8.0 cluster_2.0.7-1
## [87] factoextra_1.0.5 magrittr_1.5
## [89] data.table_1.11.4 DO.db_2.9
## [91] triebeard_0.3.0 matrixStats_0.54.0
## [93] evaluate_0.11 hms_0.4.2
## [95] xtable_1.8-3 XML_3.98-1.16
## [97] gridExtra_2.3 compiler_3.5.0
## [99] biomaRt_2.37.6 tibble_1.4.2
## [101] crayon_1.3.4 htmltools_0.3.6
## [103] GOstats_2.47.0 mgcv_1.8-24
## [105] Formula_1.2-3 tidyr_0.8.1
## [107] geneplotter_1.59.0 sendmailR_1.2-1
## [109] DBI_1.0.0 tweenr_0.1.5
## [111] corrplot_0.85 MASS_7.3-50
## [113] PerformanceAnalytics_1.5.2 ShortRead_1.39.0
## [115] Matrix_1.2-14 quadprog_1.5-5
## [117] bindr_0.1.1 igraph_1.2.2
## [119] GenomicRanges_1.33.13 pkgconfig_2.0.2
## [121] flashClust_1.01-2 rvcheck_0.1.0
## [123] GenomicAlignments_1.17.3 foreign_0.8-71
## [125] xml2_1.2.0 roxygen2_6.1.0
## [127] annotate_1.59.1 XVector_0.21.3
## [129] AnnotationForge_1.23.3 stringr_1.3.1
## [131] digest_0.6.17 graph_1.59.0
## [133] Biostrings_2.49.1 rmarkdown_1.10
## [135] fastmatch_1.1-0 htmlTable_1.12
## [137] GSEABase_1.43.1 Rsamtools_1.33.5
## [139] commonmark_1.5 rjson_0.2.20
## [141] nlme_3.1-137 jsonlite_1.5
## [143] bindrcpp_0.2.2 desc_1.2.0
## [145] viridisLite_0.3.0 pillar_1.3.0
## [147] lattice_0.20-35 KEGGREST_1.21.1
## [149] httr_1.3.1 survival_2.42-6
## [151] GO.db_3.6.0 glue_1.3.0
## [153] xts_0.11-1 UpSetR_1.3.3
## [155] bit_1.1-14 Rgraphviz_2.25.1
## [157] ggforce_0.1.3 stringi_1.2.4
## [159] blob_1.1.1 DESeq2_1.21.20
## [161] latticeExtra_0.6-28 memoise_1.1.0
## [163] dplyr_0.7.6
https://www.rna-seqblog.com/introduction-to-rna-sequencing-and-analysis/
https://en.wikipedia.org/wiki/RNA-Seq
Pang CN et al., “Transcriptome and network analyses in Saccharomyces cerevisiae reveal that amphotericin B and lactoferrin synergy disrupt metal homeostasis and stress response.”, Sci Rep, 2017 Jan 12;7:40232
Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Nature Biotechnology 2015, doi:10.1038/nbt.3122
Morgan M, Pagès H, Obenchain V, Hayden N (2018). Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 1.32.0, http://bioconductor.org/packages/release/bioc/html/Rsamtools.html.
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621.
Backman TWH, Girke T (2016). “systemPipeR: NGS workflow and report generation environment.” BMC Bioinformatics, 17(1). doi: 10.1186/s12859-016-1241-0, https://doi.org/10.1186/s12859-016-1241-0.
Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H, Gentleman R (2009). “ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” Bioinformatics, 25, 2607-2608. doi: 10.1093/bioinformatics/btp450, http://dx.doi.org10.1093/bioinformatics/btp450.
Fu J, Frazee AC, Collado-Torres L, Jaffe AE, Leek JT (2018). ballgown: Flexible, isoform-level differential expression analysis. R package version 2.12.0.
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140.
McCarthy, J. D, Chen, Yunshun, Smyth, K. G (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297.
Yu G, Wang L, Han Y, He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), 284-287. doi: 10.1089/omi.2011.0118.
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature protocols, 11(9), 1650.ISO 690
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.